Tree-Based Algorithms (Decision Trees, Random Forests, Gradient Boosting)#
Tree models are some of the most practical “workhorse” algorithms in machine learning:
Decision trees are interpretable if-else rules learned from data.
Random forests reduce variance by averaging many trees.
Gradient-boosted trees reduce bias by adding trees sequentially to fix mistakes.
A simple way to remember the core idea:
A tree is like playing 20 Questions with your data.
Each question splits the world into two parts. You want questions that quickly reduce uncertainty.
This notebook builds intuition with visual, Plotly-based demos, then implements simplified versions from scratch in NumPy, and finally connects everything to scikit-learn (and the modern boosting libraries).
Learning goals#
By the end you should be able to:
explain how decision trees work for classification and regression
use and compare Gini impurity vs entropy / information gain
implement a small CART-style tree from scratch (NumPy)
explain and implement bagging and a simplified random forest
understand what gradient boosting is doing (and how it relates to XGBoost / LightGBM / CatBoost)
Notation#
Dataset: \((x_i, y_i)\) for \(i=1..n\)
Feature matrix: \(X \in \mathbb{R}^{n \times d}\)
Targets:
classification: \(y \in \{1,\dots,K\}\)
regression: \(y \in \mathbb{R}\)
Table of contents#
Decision trees: intuition
Impurity measures (Gini, entropy, variance)
A split in slow motion (Plotly)
Decision trees from scratch (classification + regression)
scikit-learntrees + key hyperparametersEnsembles: bagging and random forests (from scratch + sklearn)
Gradient boosting and modern libraries (XGBoost / LightGBM / CatBoost)
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from dataclasses import dataclass
from sklearn.datasets import make_moons
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_text
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import log_loss
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)
1) Decision trees: intuition#
A decision tree is a sequence of questions.
Example (classification):
“Is
x1 ≤ 0.7?”yes → “Is
x2 ≤ -0.2?” → class Ano → class B
Example (regression):
“Is
area ≤ 80 m²?” → predict $220kelse “Is
area ≤ 120 m²?” → predict $310kelse → predict $420k
A tree doesn’t learn a smooth curve. It learns a set of regions, and predicts a constant value per region.
2) Impurity measures (how to pick a good question)#
At any node, the data is a “bag” of samples.
A good split makes the children bags purer.
Classification#
Let \(p_k\) be the fraction of class \(k\) in the node.
Gini impurity:
Entropy:
Both measure “messiness”. Both are 0 when the node is perfectly pure.
A common split score is information gain:
where \(I\) is either Gini or entropy.
Regression#
A common impurity is variance (equivalently MSE around the mean):
Splits are chosen to reduce the weighted variance.
def class_counts(y: np.ndarray) -> np.ndarray:
y = np.asarray(y)
if y.size == 0:
return np.array([], dtype=int)
_, counts = np.unique(y, return_counts=True)
return counts
def gini_impurity(y: np.ndarray) -> float:
counts = class_counts(y)
if counts.size == 0:
return 0.0
p = counts / counts.sum()
return float(1.0 - np.sum(p ** 2))
def entropy(y: np.ndarray) -> float:
counts = class_counts(y)
if counts.size == 0:
return 0.0
p = counts / counts.sum()
p = p[p > 0]
return float(-np.sum(p * np.log2(p)))
def variance(y: np.ndarray) -> float:
y = np.asarray(y, dtype=float)
if y.size == 0:
return 0.0
return float(np.mean((y - y.mean()) ** 2))
def information_gain(y: np.ndarray, y_left: np.ndarray, y_right: np.ndarray, impurity_fn) -> float:
n = y.size
if n == 0:
return 0.0
n_left = y_left.size
n_right = y_right.size
if n_left == 0 or n_right == 0:
return 0.0
parent = impurity_fn(y)
child = (n_left / n) * impurity_fn(y_left) + (n_right / n) * impurity_fn(y_right)
return float(parent - child)
# Quick sanity check
print('gini([A,A,B,B]) =', gini_impurity(np.array([0, 0, 1, 1])))
print('entropy([A,A,B,B]) =', entropy(np.array([0, 0, 1, 1])))
gini([A,A,B,B]) = 0.5
entropy([A,A,B,B]) = 1.0
3) A split in slow motion (1D classification)#
We’ll create a 1D dataset and evaluate “How good is the split at threshold \(t\)?”
Think of the threshold as a question:
“Is \(x \le t\)?”
We’ll plot information gain for both Gini and entropy.
# 1D binary classification dataset
n = 140
x_1d = np.concatenate([
rng.normal(loc=-1.2, scale=0.7, size=n // 2),
rng.normal(loc=1.0, scale=0.8, size=n // 2),
])
y_1d = np.array([0] * (n // 2) + [1] * (n // 2))
# Shuffle
perm = rng.permutation(n)
x_1d = x_1d[perm]
y_1d = y_1d[perm]
fig = px.scatter(
x=x_1d,
y=y_1d,
title="1D classification dataset (points are classes)",
labels={"x": "x", "y": "class"},
)
fig.update_yaxes(tickmode="array", tickvals=[0, 1])
fig.show()
# Candidate thresholds: midpoints between sorted unique x values
x_sorted = np.sort(np.unique(x_1d))
thresholds = (x_sorted[:-1] + x_sorted[1:]) / 2.0
ig_gini = []
ig_entropy = []
for t in thresholds:
left = x_1d <= t
y_left = y_1d[left]
y_right = y_1d[~left]
ig_gini.append(information_gain(y_1d, y_left, y_right, gini_impurity))
ig_entropy.append(information_gain(y_1d, y_left, y_right, entropy))
ig_gini = np.array(ig_gini)
ig_entropy = np.array(ig_entropy)
best_t_gini = float(thresholds[np.argmax(ig_gini)])
best_t_entropy = float(thresholds[np.argmax(ig_entropy)])
best_t_gini, best_t_entropy
(-0.3957878668246275, -0.3957878668246275)
fig = go.Figure()
fig.add_trace(go.Scatter(x=thresholds, y=ig_gini, mode="lines", name="Information gain (Gini)"))
fig.add_trace(go.Scatter(x=thresholds, y=ig_entropy, mode="lines", name="Information gain (Entropy)"))
fig.add_vline(x=best_t_gini, line_dash="dash", line_color="blue", annotation_text="best (Gini)")
fig.add_vline(x=best_t_entropy, line_dash="dash", line_color="orange", annotation_text="best (Entropy)")
fig.update_layout(
title="How good is a split? Information gain vs threshold",
xaxis_title="threshold t",
yaxis_title="information gain",
)
fig.show()
What did the “best split” actually do?#
Let’s look at the before/after impurity numbers at the best Gini threshold.
This makes the abstract formula feel more concrete:
parent node: one mixed bag
left/right nodes: two less-mixed bags
# Inspect impurity before/after the best split (Gini)
left = x_1d <= best_t_gini
g_parent = gini_impurity(y_1d)
g_left = gini_impurity(y_1d[left])
g_right = gini_impurity(y_1d[~left])
gain = information_gain(y_1d, y_1d[left], y_1d[~left], gini_impurity)
counts_parent = np.bincount(y_1d, minlength=2)
counts_left = np.bincount(y_1d[left], minlength=2)
counts_right = np.bincount(y_1d[~left], minlength=2)
print('counts parent:', counts_parent, 'gini:', round(g_parent, 4))
print('counts left :', counts_left, 'gini:', round(g_left, 4))
print('counts right :', counts_right, 'gini:', round(g_right, 4))
print('information gain:', round(gain, 4))
counts parent: [70 70] gini: 0.5
counts left : [67 0] gini: 0.0
counts right : [ 3 70] gini: 0.0788
information gain: 0.4589
# Visualize the split on the 1D data
jitter = rng.normal(0, 0.03, size=y_1d.size)
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_1d, y=y_1d + jitter, mode='markers', name='data'))
fig.add_vline(x=best_t_gini, line_dash='dash', line_color='black', annotation_text='best split')
fig.update_layout(
title='The best split threshold on the 1D dataset',
xaxis_title='x',
yaxis_title='class (jittered)'
)
fig.show()
4) Decision trees from scratch (NumPy)#
Below we implement a simplified, educational version of a CART-style tree:
continuous features
binary splits of the form \(x_j \le t\)
greedy choice: pick the best (feature, threshold) at each node
This is not optimized for speed; it’s meant to be readable.
4.1 Decision tree classifier (Gini or entropy)#
Stopping rules (common “pre-pruning” controls):
max_depthmin_samples_splitmin_samples_leaf
Random forest also needs max_features: only consider a random subset of features at each split.
@dataclass
class Node:
feature_index: int | None = None
threshold: float | None = None
left: "Node | None" = None
right: "Node | None" = None
value: object | None = None # predicted class (classifier) or mean value (regressor)
proba: np.ndarray | None = None # class probabilities for classifier
def _resolve_max_features(max_features, n_features: int) -> int:
if max_features is None:
return n_features
if isinstance(max_features, int):
return max(1, min(n_features, max_features))
if isinstance(max_features, float):
return max(1, min(n_features, int(np.ceil(max_features * n_features))))
if isinstance(max_features, str):
key = max_features.lower()
if key == "sqrt":
return max(1, int(np.sqrt(n_features)))
if key == "log2":
return max(1, int(np.log2(n_features)))
raise ValueError(f"Unsupported max_features={max_features!r}")
class DecisionTreeClassifierScratch:
def __init__(
self,
*,
criterion: str = "gini",
max_depth: int | None = None,
min_samples_split: int = 2,
min_samples_leaf: int = 1,
max_features=None,
random_state: int = 0,
):
self.criterion = criterion
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
self.max_features = max_features
self.random_state = random_state
self.root_: Node | None = None
self.classes_: np.ndarray | None = None
self._rng = np.random.default_rng(random_state)
def _impurity(self, y: np.ndarray) -> float:
if self.criterion == "gini":
return gini_impurity(y)
if self.criterion in {"entropy", "log_loss"}:
return entropy(y)
raise ValueError(f"Unknown criterion: {self.criterion}")
def _best_split(self, X: np.ndarray, y: np.ndarray) -> tuple[int | None, float | None, float]:
n_samples, n_features = X.shape
if n_samples < self.min_samples_split:
return None, None, 0.0
parent_impurity = self._impurity(y)
if parent_impurity == 0.0:
return None, None, 0.0
k = _resolve_max_features(self.max_features, n_features)
feature_indices = self._rng.choice(n_features, size=k, replace=False)
best_gain = 0.0
best_feature = None
best_threshold = None
for feature_index in feature_indices:
x_col = X[:, feature_index]
uniq = np.unique(x_col)
if uniq.size <= 1:
continue
thresholds = (uniq[:-1] + uniq[1:]) / 2.0
for t in thresholds:
left = x_col <= t
n_left = int(left.sum())
n_right = n_samples - n_left
if n_left < self.min_samples_leaf or n_right < self.min_samples_leaf:
continue
gain = information_gain(y, y[left], y[~left], self._impurity)
if gain > best_gain:
best_gain = gain
best_feature = int(feature_index)
best_threshold = float(t)
return best_feature, best_threshold, float(best_gain)
def _build(self, X: np.ndarray, y: np.ndarray, depth: int) -> Node:
counts = np.bincount(y, minlength=len(self.classes_))
proba = counts / counts.sum() if counts.sum() else np.zeros_like(counts, dtype=float)
predicted_class = int(np.argmax(counts)) if counts.size else 0
node = Node(value=self.classes_[predicted_class], proba=proba)
if self.max_depth is not None and depth >= self.max_depth:
return node
if y.size < self.min_samples_split:
return node
if np.unique(y).size == 1:
return node
feature, threshold, gain = self._best_split(X, y)
if feature is None or threshold is None or gain <= 0.0:
return node
left_mask = X[:, feature] <= threshold
node.feature_index = feature
node.threshold = threshold
node.left = self._build(X[left_mask], y[left_mask], depth + 1)
node.right = self._build(X[~left_mask], y[~left_mask], depth + 1)
return node
def fit(self, X: np.ndarray, y: np.ndarray):
X = np.asarray(X, dtype=float)
y_raw = np.asarray(y)
self.classes_, y_enc = np.unique(y_raw, return_inverse=True)
self.root_ = self._build(X, y_enc, depth=0)
return self
def _predict_one(self, x_row: np.ndarray) -> object:
node = self.root_
while node is not None and node.feature_index is not None:
node = node.left if x_row[node.feature_index] <= node.threshold else node.right
return node.value
def predict(self, X: np.ndarray) -> np.ndarray:
X = np.asarray(X, dtype=float)
return np.array([self._predict_one(row) for row in X])
print('Scratch tree ready')
Scratch tree ready
Visual demo: scratch tree vs sklearn tree (classification)#
We’ll use make_moons (a classic non-linear dataset), and compare:
a shallow
DecisionTreeClassifierScratchsklearn.tree.DecisionTreeClassifier
We’ll plot the decision boundary.
def plot_decision_boundary(model, X: np.ndarray, y: np.ndarray, title: str, grid_steps: int = 250):
x_min, x_max = X[:, 0].min() - 0.6, X[:, 0].max() + 0.6
y_min, y_max = X[:, 1].min() - 0.6, X[:, 1].max() + 0.6
xs = np.linspace(x_min, x_max, grid_steps)
ys = np.linspace(y_min, y_max, grid_steps)
xx, yy = np.meshgrid(xs, ys)
grid = np.column_stack([xx.ravel(), yy.ravel()])
z = model.predict(grid).reshape(xx.shape)
fig = go.Figure()
fig.add_trace(
go.Contour(
x=xs,
y=ys,
z=z,
showscale=False,
opacity=0.35,
contours=dict(showlines=False),
colorscale=[[0, "#7aa6ff"], [1, "#ff8c7a"]],
)
)
fig.add_trace(
go.Scatter(
x=X[:, 0],
y=X[:, 1],
mode="markers",
marker=dict(size=6, color=y, colorscale=[[0, "#1f77b4"], [1, "#d62728"]], line=dict(width=0)),
name="data",
)
)
fig.update_layout(title=title, xaxis_title="x1", yaxis_title="x2")
return fig
X_moons, y_moons = make_moons(n_samples=450, noise=0.25, random_state=42)
X_tr, X_te, y_tr, y_te = train_test_split(X_moons, y_moons, test_size=0.25, random_state=42)
scratch_tree = DecisionTreeClassifierScratch(max_depth=4, min_samples_leaf=5, criterion="gini", random_state=42)
scratch_tree.fit(X_tr, y_tr)
sk_tree = DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, criterion="gini", random_state=42)
sk_tree.fit(X_tr, y_tr)
acc_scratch = accuracy_score(y_te, scratch_tree.predict(X_te))
acc_sklearn = accuracy_score(y_te, sk_tree.predict(X_te))
acc_scratch, acc_sklearn
(0.911504424778761, 0.911504424778761)
What did the tree learn?#
A nice sanity check is to print the learned rules.
sklearnprovidesexport_text(...)for our scratch tree, we’ll print a tiny readable representation
def print_scratch_tree(node, feature_names=("x1", "x2"), indent=""):
if node.feature_index is None:
if node.proba is None:
print(f"{indent}Predict: {node.value}")
else:
p = np.array2string(node.proba, precision=2, floatmode='fixed')
print(f"{indent}Predict: {node.value} (proba={p})")
return
name = feature_names[node.feature_index] if feature_names else f"x[{node.feature_index}]"
print(f"{indent}if {name} <= {node.threshold:.3f}:")
print_scratch_tree(node.left, feature_names=feature_names, indent=indent + " ")
print(f"{indent}else:")
print_scratch_tree(node.right, feature_names=feature_names, indent=indent + " ")
print("Scratch tree rules:")
print()
print_scratch_tree(scratch_tree.root_)
print()
print("---")
print()
print("sklearn tree rules:")
print()
print(export_text(sk_tree, feature_names=["x1", "x2"]))
Scratch tree rules:
if x2 <= 0.230:
if x1 <= -0.476:
Predict: 0 (proba=[1.00 0.00])
else:
if x2 <= -0.055:
if x2 <= -0.352:
Predict: 1 (proba=[0.00 1.00])
else:
Predict: 1 (proba=[0.05 0.95])
else:
if x1 <= 1.479:
Predict: 1 (proba=[0.34 0.66])
else:
Predict: 1 (proba=[0.00 1.00])
else:
if x1 <= 1.378:
if x2 <= 0.536:
if x1 <= 0.537:
Predict: 0 (proba=[0.70 0.30])
else:
Predict: 0 (proba=[1.00 0.00])
else:
if x2 <= 0.757:
Predict: 0 (proba=[0.91 0.09])
else:
Predict: 0 (proba=[1.00 0.00])
else:
Predict: 1 (proba=[0.00 1.00])
---
sklearn tree rules:
|--- x2 <= 0.23
| |--- x1 <= -0.48
| | |--- class: 0
| |--- x1 > -0.48
| | |--- x2 <= -0.06
| | | |--- x2 <= -0.35
| | | | |--- class: 1
| | | |--- x2 > -0.35
| | | | |--- class: 1
| | |--- x2 > -0.06
| | | |--- x1 <= 1.48
| | | | |--- class: 1
| | | |--- x1 > 1.48
| | | | |--- class: 1
|--- x2 > 0.23
| |--- x1 <= 1.38
| | |--- x2 <= 0.54
| | | |--- x1 <= 0.54
| | | | |--- class: 0
| | | |--- x1 > 0.54
| | | | |--- class: 0
| | |--- x2 > 0.54
| | | |--- x2 <= 0.76
| | | | |--- class: 0
| | | |--- x2 > 0.76
| | | | |--- class: 0
| |--- x1 > 1.38
| | |--- class: 1
fig1 = plot_decision_boundary(scratch_tree, X_te, y_te, f"Scratch DecisionTreeClassifier (acc={acc_scratch:.3f})")
fig1.show()
fig2 = plot_decision_boundary(sk_tree, X_te, y_te, f"sklearn DecisionTreeClassifier (acc={acc_sklearn:.3f})")
fig2.show()
4.2 Decision tree regressor (variance reduction)#
For regression, the “purity” is about how tightly clustered the target values are.
leaf prediction is usually the mean of \(y\) in that leaf
the split tries to reduce weighted variance
A tree regressor produces a piecewise constant function.
class DecisionTreeRegressorScratch:
def __init__(
self,
*,
max_depth: int | None = None,
min_samples_split: int = 2,
min_samples_leaf: int = 1,
max_features=None,
random_state: int = 0,
):
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
self.max_features = max_features
self.random_state = random_state
self.root_: Node | None = None
self._rng = np.random.default_rng(random_state)
def _best_split(self, X: np.ndarray, y: np.ndarray) -> tuple[int | None, float | None, float]:
n_samples, n_features = X.shape
if n_samples < self.min_samples_split:
return None, None, 0.0
parent_var = variance(y)
if parent_var == 0.0:
return None, None, 0.0
k = _resolve_max_features(self.max_features, n_features)
feature_indices = self._rng.choice(n_features, size=k, replace=False)
best_reduction = 0.0
best_feature = None
best_threshold = None
for feature_index in feature_indices:
x_col = X[:, feature_index]
uniq = np.unique(x_col)
if uniq.size <= 1:
continue
thresholds = (uniq[:-1] + uniq[1:]) / 2.0
for t in thresholds:
left = x_col <= t
n_left = int(left.sum())
n_right = n_samples - n_left
if n_left < self.min_samples_leaf or n_right < self.min_samples_leaf:
continue
child_var = (n_left / n_samples) * variance(y[left]) + (n_right / n_samples) * variance(y[~left])
reduction = parent_var - child_var
if reduction > best_reduction:
best_reduction = float(reduction)
best_feature = int(feature_index)
best_threshold = float(t)
return best_feature, best_threshold, float(best_reduction)
def _build(self, X: np.ndarray, y: np.ndarray, depth: int) -> Node:
node = Node(value=float(np.mean(y)) if y.size else 0.0)
if self.max_depth is not None and depth >= self.max_depth:
return node
if y.size < self.min_samples_split:
return node
if variance(y) == 0.0:
return node
feature, threshold, reduction = self._best_split(X, y)
if feature is None or threshold is None or reduction <= 0.0:
return node
left_mask = X[:, feature] <= threshold
node.feature_index = feature
node.threshold = threshold
node.left = self._build(X[left_mask], y[left_mask], depth + 1)
node.right = self._build(X[~left_mask], y[~left_mask], depth + 1)
return node
def fit(self, X: np.ndarray, y: np.ndarray):
X = np.asarray(X, dtype=float)
y = np.asarray(y, dtype=float)
self.root_ = self._build(X, y, depth=0)
return self
def _predict_one(self, x_row: np.ndarray) -> float:
node = self.root_
while node is not None and node.feature_index is not None:
node = node.left if x_row[node.feature_index] <= node.threshold else node.right
return float(node.value)
def predict(self, X: np.ndarray) -> np.ndarray:
X = np.asarray(X, dtype=float)
return np.array([self._predict_one(row) for row in X], dtype=float)
print('Scratch regressor ready')
Scratch regressor ready
# 1D regression dataset: nonlinear signal
n = 260
x_reg = np.sort(rng.uniform(-3, 3, size=n))
y_reg = np.sin(x_reg) + 0.25 * rng.normal(size=n)
X_reg = x_reg.reshape(-1, 1)
fig = px.scatter(x=x_reg, y=y_reg, title="Regression dataset: y = sin(x) + noise", labels={"x": "x", "y": "y"})
fig.show()
# Fit trees with different depths to show piecewise-constant behavior
x_grid = np.linspace(x_reg.min(), x_reg.max(), 600)
X_grid = x_grid.reshape(-1, 1)
depths = [1, 2, 3, 5, 8]
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_reg, y=y_reg, mode="markers", name="data", marker=dict(size=5)))
for d in depths:
tree = DecisionTreeRegressorScratch(max_depth=d, min_samples_leaf=6, random_state=42)
tree.fit(X_reg, y_reg)
y_hat = tree.predict(X_grid)
fig.add_trace(go.Scatter(x=x_grid, y=y_hat, mode="lines", name=f"depth={d}"))
fig.update_layout(title="Scratch regression trees: deeper = more flexible", xaxis_title="x", yaxis_title="prediction")
fig.show()
Compare with sklearn (regression)#
The overall shape should look similar: deeper trees fit more wiggles.
# sklearn decision tree regressor on the same 1D regression data
from sklearn.tree import DecisionTreeRegressor
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_reg, y=y_reg, mode="markers", name="data", marker=dict(size=5)))
for d in [1, 2, 3, 5, 8]:
m = DecisionTreeRegressor(max_depth=d, min_samples_leaf=6, random_state=42)
m.fit(X_reg, y_reg)
fig.add_trace(go.Scatter(x=x_grid, y=m.predict(X_grid), mode="lines", name=f"sk depth={d}"))
fig.update_layout(title="sklearn regression trees", xaxis_title="x", yaxis_title="prediction")
fig.show()
Random forest regressor (variance reduction in regression)#
A single regression tree can be very “blocky”. A random forest regressor averages many trees and often produces a smoother approximation.
# Train/test split for the regression task
Xr_tr, Xr_te, yr_tr, yr_te = train_test_split(X_reg, y_reg, test_size=0.25, random_state=42)
single_reg = DecisionTreeRegressor(max_depth=8, min_samples_leaf=3, random_state=42)
rf_reg = RandomForestRegressor(
n_estimators=400,
max_depth=None,
max_features=1.0,
min_samples_leaf=3,
random_state=42,
)
single_reg.fit(Xr_tr, yr_tr)
rf_reg.fit(Xr_tr, yr_tr)
mse_single = mean_squared_error(yr_te, single_reg.predict(Xr_te))
mse_rf = mean_squared_error(yr_te, rf_reg.predict(Xr_te))
mse_single, mse_rf
(0.07297656765298927, 0.06900414442614336)
fig = go.Figure()
fig.add_trace(go.Scatter(x=x_reg, y=y_reg, mode="markers", name="data", marker=dict(size=5)))
fig.add_trace(go.Scatter(x=x_grid, y=single_reg.predict(X_grid), mode="lines", name=f"single tree (MSE={mse_single:.3f})"))
fig.add_trace(go.Scatter(x=x_grid, y=rf_reg.predict(X_grid), mode="lines", name=f"random forest (MSE={mse_rf:.3f})"))
fig.update_layout(title="Regression: single tree vs random forest", xaxis_title="x", yaxis_title="prediction")
fig.show()
5) scikit-learn trees + key hyperparameters#
scikit-learn uses an optimized CART implementation.
Common parameters:
criterion:classifier:
"gini"or"entropy"(also"log_loss"in newer versions)regressor:
"squared_error"(variance reduction)
max_depth: maximum depth (prevents overfitting)min_samples_split,min_samples_leaf: prevent overly small leavesmax_features: random subset of features per split (key for random forests)ccp_alpha: cost-complexity pruning (post-pruning)
Below is a quick look at a trained sklearn tree in text form.
sk_tree_small = DecisionTreeClassifier(max_depth=3, min_samples_leaf=5, criterion="gini", random_state=42)
sk_tree_small.fit(X_tr, y_tr)
print(export_text(sk_tree_small, feature_names=["x1", "x2"]))
|--- x2 <= 0.23
| |--- x1 <= -0.48
| | |--- class: 0
| |--- x1 > -0.48
| | |--- x2 <= -0.06
| | | |--- class: 1
| | |--- x2 > -0.06
| | | |--- class: 1
|--- x2 > 0.23
| |--- x1 <= 1.38
| | |--- x2 <= 0.54
| | | |--- class: 0
| | |--- x2 > 0.54
| | | |--- class: 0
| |--- x1 > 1.38
| | |--- class: 1
6) Ensembles: bagging and random forests#
A single decision tree is a high-variance learner:
small data changes can produce a different tree
Bagging (“bootstrap aggregating”) reduces variance:
sample many bootstrap datasets
train a model on each
average (regression) or majority vote (classification)
Random forests add another trick:
at each split, only consider a random subset of features (
max_features)
This decorrelates the trees, making the average stronger.
Analogy:
One expert can be inconsistent.
A panel of experts who don’t all think the same way tends to be more reliable.
class RandomForestClassifierScratch:
def __init__(
self,
*,
n_estimators: int = 50,
max_depth: int | None = None,
min_samples_split: int = 2,
min_samples_leaf: int = 1,
max_features="sqrt",
random_state: int = 0,
):
self.n_estimators = n_estimators
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.min_samples_leaf = min_samples_leaf
self.max_features = max_features
self.random_state = random_state
self.trees_: list[DecisionTreeClassifierScratch] = []
self._rng = np.random.default_rng(random_state)
def fit(self, X: np.ndarray, y: np.ndarray):
X = np.asarray(X, dtype=float)
y = np.asarray(y)
n_samples = X.shape[0]
self.trees_ = []
for _ in range(self.n_estimators):
idx = self._rng.integers(0, n_samples, size=n_samples)
tree = DecisionTreeClassifierScratch(
criterion="gini",
max_depth=self.max_depth,
min_samples_split=self.min_samples_split,
min_samples_leaf=self.min_samples_leaf,
max_features=self.max_features,
random_state=int(self._rng.integers(0, 1_000_000_000)),
)
tree.fit(X[idx], y[idx])
self.trees_.append(tree)
return self
def predict(self, X: np.ndarray) -> np.ndarray:
X = np.asarray(X, dtype=float)
preds = np.vstack([tree.predict(X) for tree in self.trees_])
# Majority vote per column
out = []
for j in range(preds.shape[1]):
values, counts = np.unique(preds[:, j], return_counts=True)
out.append(values[int(np.argmax(counts))])
return np.array(out)
# Compare: single tree vs scratch forest
single = DecisionTreeClassifierScratch(max_depth=6, min_samples_leaf=2, random_state=42)
single.fit(X_tr, y_tr)
forest = RandomForestClassifierScratch(n_estimators=120, max_depth=8, min_samples_leaf=2, max_features="sqrt", random_state=42)
forest.fit(X_tr, y_tr)
acc_single = accuracy_score(y_te, single.predict(X_te))
acc_forest = accuracy_score(y_te, forest.predict(X_te))
acc_single, acc_forest
(0.8938053097345132, 0.9203539823008849)
fig = plot_decision_boundary(single, X_te, y_te, f"Scratch single tree (acc={acc_single:.3f})")
fig.show()
fig = plot_decision_boundary(forest, X_te, y_te, f"Scratch random forest (acc={acc_forest:.3f})")
fig.show()
Random forests in scikit-learn#
Key parameters you’ll actually tune:
n_estimators: number of treesmax_depth: tree depthmax_features: features per split ("sqrt"is a common default)min_samples_leaf: makes trees less “spiky”bootstrap: whether to bootstrap samplesoob_score: out-of-bag estimate (free-ish validation when bootstrapping)
rf = RandomForestClassifier(
n_estimators=300,
max_depth=None,
max_features="sqrt",
min_samples_leaf=2,
bootstrap=True,
oob_score=True,
random_state=42,
)
rf.fit(X_tr, y_tr)
acc_rf = accuracy_score(y_te, rf.predict(X_te))
acc_rf, rf.oob_score_
(0.9203539823008849, 0.9287833827893175)
# Feature importances (on this synthetic dataset they're not very meaningful, but the API is)
fig = px.bar(x=["x1", "x2"], y=rf.feature_importances_, title="RandomForest feature_importances_")
fig.update_layout(xaxis_title="feature", yaxis_title="importance")
fig.show()
7) Gradient boosting and modern libraries#
Bagging/forests mainly reduce variance.
Boosting mainly reduces bias.
The tutoring analogy#
Imagine a student learning math:
Tutor #1 explains the basics.
Tutor #2 focuses on what the student still gets wrong.
Tutor #3 focuses on the remaining mistakes.
Boosting is like that: it builds an additive model
where each new tree \(f_m\) is trained to reduce the current errors.
Modern boosting libraries:
XGBoost: strong regularization, second-order optimization, robust defaults, great for tabular
LightGBM: histogram + leaf-wise growth, very fast on large data
CatBoost: excellent categorical handling (ordered target statistics), strong out-of-the-box performance
We’ll demonstrate boosting with sklearn and then map the ideas to the libraries.
# Boosting demo (sklearn HistGradientBoostingClassifier)
# Note: this model can work without scaling; we keep features as-is.
X_tr2, X_val2, y_tr2, y_val2 = train_test_split(X_tr, y_tr, test_size=0.25, random_state=7)
gb = HistGradientBoostingClassifier(
max_depth=3,
learning_rate=0.1,
max_iter=250,
random_state=42,
)
gb.fit(X_tr2, y_tr2)
val_losses = []
for proba in gb.staged_predict_proba(X_val2):
val_losses.append(log_loss(y_val2, proba))
best_iter = int(np.argmin(val_losses) + 1)
fig = px.line(x=np.arange(1, len(val_losses) + 1), y=val_losses, title=f"Boosting: validation log loss (best_iter={best_iter})")
fig.update_layout(xaxis_title="# trees (iterations)", yaxis_title="log loss")
fig.show()
best_iter
47
XGBoost vs LightGBM vs CatBoost (high-level)#
All three are gradient-boosted decision tree (GBDT) systems, but they differ in engineering choices.
Tree growth strategy#
XGBoost (often): level-wise growth (more balanced)
LightGBM: typically leaf-wise growth (can be faster/more accurate, can overfit if unconstrained)
CatBoost: commonly uses symmetric / oblivious trees (same split per depth level)
Categorical features#
CatBoost: built-in strong categorical handling (ordered target statistics)
XGBoost / LightGBM: can handle categoricals in newer versions/modes, but many workflows still rely on encoding
Practical hyperparameter map (mental model)#
Concept |
XGBoost |
LightGBM |
CatBoost |
|---|---|---|---|
# trees |
|
|
|
step size |
|
|
|
tree size |
|
|
|
row sampling |
|
|
|
col sampling |
|
|
|
L2 reg |
|
|
|
L1 reg |
|
|
(limited / different) |
Rule of thumb:
If you increase tree complexity, counterbalance with stronger regularization and/or smaller learning rate.
# Optional: check whether the big boosting libs are installed in this environment.
# (This notebook still makes sense if they're not installed.)
libs = {
"xgboost": "xgboost",
"lightgbm": "lightgbm",
"catboost": "catboost",
}
for name, mod in libs.items():
try:
m = __import__(mod)
ver = getattr(m, "__version__", "(unknown version)")
print(f"{name}: installed ({ver})")
except Exception as e:
print(f"{name}: not installed ({type(e).__name__}: {e})")
xgboost: not installed (ModuleNotFoundError: No module named 'xgboost')
lightgbm: not installed (ModuleNotFoundError: No module named 'lightgbm')
catboost: not installed (ModuleNotFoundError: No module named 'catboost')
Exercises#
Change
max_depthof a tree and visualize how the decision boundary changes.Increase noise in
make_moonsand compare single tree vs forest.For boosting, try
learning_rate=0.05with moremax_iterand see the validation curve.
References#
ISLR (James, Witten, Hastie, Tibshirani)
ESL (Hastie, Tibshirani, Friedman)
sklearndocumentation: DecisionTree, RandomForest, HistGradientBoostingXGBoost / LightGBM / CatBoost official docs